{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LlamaParse Agent\n",
    "\n",
    "This demo walks through using an OpenAI Agent with [LlamaParse](https://cloud.llamaindex.ai)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-parse llama-index llama-index-postprocessor-sbert-rerank"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-...\"\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import Settings\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "from llama_index.llms.openai import OpenAI\n",
    "\n",
    "Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\")\n",
    "Settings.llm = OpenAI(model=\"gpt-3.5-turbo\", temperature=0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parsing \n",
    "\n",
    "For parsing, lets use a [recent paper](https://huggingface.co/papers/2403.09611) on Multi-Modal pretraining"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://arxiv.org/pdf/2403.09611.pdf -O paper.pdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below, we can tell the parser to skip content we don't want. In this case, the references section will just add noise to a RAG system."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_parse import LlamaParse\n",
    "\n",
    "parser = LlamaParse(\n",
    "    result_type=\"markdown\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Started parsing the file under job_id 81251f39-01be-434e-99e8-1c1b83b82098\n"
     ]
    }
   ],
   "source": [
    "documents = await parser.aload_data(\"paper.pdf\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Embeddings have been explicitly disabled. Using MockEmbedding.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "41it [00:00, 26765.21it/s]\n",
      "100%|██████████| 41/41 [00:13<00:00,  2.98it/s]\n"
     ]
    }
   ],
   "source": [
    "import nest_asyncio\n",
    "nest_asyncio.apply()\n",
    "\n",
    "from llama_index.core.node_parser import MarkdownElementNodeParser, SentenceSplitter\n",
    "\n",
    "# explicitly extract tables with the MarkdownElementNodeParser\n",
    "node_parser = MarkdownElementNodeParser(num_workers=8)\n",
    "nodes = node_parser.get_nodes_from_documents(documents)\n",
    "nodes, objects = node_parser.get_nodes_and_objects(nodes)\n",
    "\n",
    "# Chain splitters to ensure chunk size requirements are met\n",
    "nodes = SentenceSplitter(chunk_size=512, chunk_overlap=20).get_nodes_from_documents(nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chat over the paper, lets find out what it is about!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex, SummaryIndex\n",
    "\n",
    "vector_index = VectorStoreIndex(nodes=nodes)\n",
    "summary_index = SummaryIndex(nodes=nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.agent.openai import OpenAIAgent\n",
    "from llama_index.core.tools import QueryEngineTool, ToolMetadata\n",
    "from llama_index.postprocessor.colbert_rerank import ColbertRerank\n",
    "\n",
    "tools = [\n",
    "    QueryEngineTool(\n",
    "        vector_index.as_query_engine(\n",
    "            similarity_top_k=8,\n",
    "            node_postprocessors=[ColbertRerank(top_n=3)]\n",
    "        ),\n",
    "        metadata=ToolMetadata(\n",
    "            name=\"search\",\n",
    "            description=\"Search the document, pass the entire user message in the query\",\n",
    "        ),\n",
    "    ),\n",
    "    QueryEngineTool(\n",
    "        summary_index.as_query_engine(),\n",
    "        metadata=ToolMetadata(\n",
    "            name=\"summarize\",\n",
    "            description=\"Summarize the document using the user message\",\n",
    "        ),\n",
    "    ),\n",
    "]\n",
    "\n",
    "agent = OpenAIAgent.from_tools(\n",
    "    tools=tools, \n",
    "    verbose=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Added user message to memory: What is the summary of the paper?\n",
      "=== Calling Function ===\n",
      "Calling function: summarize with args: {\"input\":\"summary\"}\n",
      "Got output: The research focuses on developing Multimodal Large Language Models (MLLMs) by incorporating image-caption, interleaved image-text, and text-only data for pre-training. It highlights the importance of factors like the image encoder, resolution, and token count, while downplaying the design of the vision-language connector. With models scaling up to 30B parameters, the MM1 family demonstrates impressive performance in pre-training metrics and competitive outcomes on diverse multimodal benchmarks. It demonstrates abilities such as in-context learning and multi-image reasoning, aiming to provide valuable insights for creating MLLMs that benefit the research community.\n",
      "========================\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# note -- this will take a while with local LLMs, its sending every node in the document to the LLM\n",
    "resp = agent.chat(\"What is the summary of the paper?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The summary of the paper highlights the development of Multimodal Large Language Models (MLLMs) by incorporating image-caption, interleaved image-text, and text-only data for pre-training. The research emphasizes factors like the image encoder, resolution, and token count, while de-emphasizing the design of the vision-language connector. The MM1 family of models, scaling up to 30B parameters, shows impressive performance in pre-training metrics and competitive outcomes on various multimodal benchmarks. These models demonstrate capabilities such as in-context learning and multi-image reasoning, aiming to provide valuable insights for creating MLLMs that benefit the research community.\n"
     ]
    }
   ],
   "source": [
    "print(str(resp))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Added user message to memory: How do the authors evaluate their work?\n",
      "=== Calling Function ===\n",
      "Calling function: search with args: {\"input\":\"evaluation methods\"}\n",
      "Got output: The evaluation methods involve synthesizing all benchmark results into a single meta-average number to simplify comparisons. This is achieved by normalizing the evaluation metrics with respect to a baseline configuration, standardizing the results for each task, adjusting every metric by dividing it by its respective baseline, and then averaging across all metrics.\n",
      "========================\n",
      "\n"
     ]
    }
   ],
   "source": [
    "resp = agent.chat(\"How do the authors evaluate their work?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The authors evaluate their work by synthesizing all benchmark results into a single meta-average number to simplify comparisons. They normalize the evaluation metrics with respect to a baseline configuration, standardize the results for each task, adjust every metric by dividing it by its respective baseline, and then average across all metrics for evaluation.\n"
     ]
    }
   ],
   "source": [
    "print(str(resp))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-parse-aNC435Vv-py3.10",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
